So far, we have learned about the following topics. For this project, the following options will be pursued for modeling and detailed analysis.
Citi Bike is a bicycle sharing service in New York city operated by Motivate, an organization that manages some of the globe’s largest bike networks in large cities. Because bikes must be picked up and dropped off at Citi Bike docking stations, people can easily take one-way trips and drop their bike off at 1 of 900 stations.
With Citi Bikes being available 24 hours/day, 7 days/week, 365 days/year, NYC’s enormous population, and variable weather, we believe there are insights to be gleaned from historical rider and local weather data. There are also opportunities for improving safety by sending notifications to riders’ phones indicating riding conditions and availability.
This report will use Citi Bike rider and NYC weather data to predict daily customer behavior and describe the relative impact of different weather events on bike usage.
This document presents the group project for which city bike data in NYC is used. The purpose of this document is to share analysis of the influence of weather to the bike usage.
Here, city bike data from 2019 will be used for the analysis. Specifically, 5% of the entire 2019 data was extracted, and the extracted data was categorized into average ride usage per day (365 days in total). Thereafter, the yearly usage data was randomly divided into train (80%, 292 days) and test (20%, 73 days) for the modeling, testing the accuracy of the models, and business analytics.
Ultimately, the findings will be used to identify the relationship between weather and bike usage and predict the bike usage based on chosen parameters (e.g. precipitation amount). Thereafter, some insights will be presented for the bike business.
Note that there was an effort to retrieve station capacity information into the dataset. However, it was not pursued as only 20% of the stations in the dataset were identified with the capacity.
NOTE: Due to extreme large raw data set size, a separate RMD is used to process raw data and generate the dataset used for all analysis in this RMD report.
The raw Citi bike 2019 usage dataset is obtained from the website “https://s3.amazonaws.com/tripdata/index.html”. All 12 months of data in 2019 is combined first. Due to the extreme large data size (over 22 million rows), a sample size of 5% of the raw dataset is generated using set.seed() function resulting in over 1 million rows of data. Then a few data cleaning processes were completed:
The raw NYC weather data is obtained from NOAA website through Python API. The year, month, day, hour and minute information are extracted from the “Date” column.
The bike data and weather data were merged using the year, month and day as the relational keys. The final combined dataset were then separated into train (80%) and test (20%) dataset and stored in files “train_final.csv” and “test_final.csv”. The team didn’t realize that train/test separation should not have been completed in such an early stage until later in the process. However, it was very time consuming to go through everything discussed above just for a combined 5% dataset. Therefore, the team decided to just aggregate both “train_final.csv” and “test_final.csv” at the beginning of further analysis which is much less time consuming.
Ensure that no additional clean up is needed after train and test datasets are extracted from the entire 2019 usage.
#train dataset
traindata$X = NULL
traindata$Date = NULL
traindata$start_station_id = as.integer(traindata$start_station_id)
colnames(traindata)## [1] "start_month" "start_day" "start_station_id"
## [4] "usertype" "day" "age_group"
## [7] "time" "day_count" "avg_wind_speed"
## [10] "TMIN" "TMAX" "PRCP"
## [13] "SNOW" "avg_trip_duration" "avg_speed"
## [16] "frequency"
## 'data.frame': 551829 obs. of 16 variables:
## $ start_month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ start_day : int 1 1 1 1 1 1 1 1 1 1 ...
## $ start_station_id : int 749 792 89 397 199 200 778 784 206 88 ...
## $ usertype : Factor w/ 2 levels "Customer","Subscriber": 2 2 2 2 2 2 2 2 2 2 ...
## $ day : Factor w/ 7 levels "Friday","Monday",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ age_group : Factor w/ 4 levels "Adult","Elderly",..: 2 1 3 1 3 3 1 1 1 3 ...
## $ time : Factor w/ 4 levels "afternoon","evening",..: 1 3 2 2 1 3 1 1 2 2 ...
## $ day_count : int 1 1 1 1 1 1 1 1 1 1 ...
## $ avg_wind_speed : num 0 0 0 0 0 0 0 0 0 0 ...
## $ TMIN : num 3.9 3.9 3.9 3.9 3.9 3.9 3.9 3.9 3.9 3.9 ...
## $ TMAX : num 14.4 14.4 14.4 14.4 14.4 14.4 14.4 14.4 14.4 14.4 ...
## $ PRCP : num 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 ...
## $ SNOW : num 0 0 0 0 0 0 0 0 0 0 ...
## $ avg_trip_duration: num 1378 820 1287 950 192 ...
## $ avg_speed : num 2.6 1.91 2.63 1.84 2.47 ...
## $ frequency : int 1 1 1 2 1 3 3 2 2 1 ...
## start_month start_day start_station_id usertype
## Min. : 1.000 Min. : 1.0 Min. : 1.0 Customer : 94173
## 1st Qu.: 4.000 1st Qu.: 8.0 1st Qu.:186.0 Subscriber:457656
## Median : 7.000 Median :15.0 Median :411.0
## Mean : 6.646 Mean :15.2 Mean :428.2
## 3rd Qu.: 9.000 3rd Qu.:22.0 3rd Qu.:710.0
## Max. :12.000 Max. :31.0 Max. :828.0
##
## day age_group time day_count
## Friday :86283 Adult :334701 afternoon:139095 Min. : 1.0
## Monday :77317 Elderly : 17920 evening :164354 1st Qu.:109.0
## Saturday :86853 Middle Age:190794 morning :163446 Median :190.0
## Sunday :71198 Teenager : 8414 night : 84934 Mean :186.1
## Thursday :77026 3rd Qu.:258.0
## Tuesday :83447 Max. :364.0
## Wednesday:69705
## avg_wind_speed TMIN TMAX PRCP
## Min. :0.000 Min. :-16.60 Min. :-9.90 Min. : 0.000
## 1st Qu.:1.200 1st Qu.: 3.90 1st Qu.:11.70 1st Qu.: 0.000
## Median :1.600 Median : 12.80 Median :21.70 Median : 0.000
## Mean :1.712 Mean : 11.35 Mean :19.35 Mean : 2.744
## 3rd Qu.:2.300 3rd Qu.: 18.90 3rd Qu.:27.20 3rd Qu.: 1.500
## Max. :5.700 Max. : 26.70 Max. :35.00 Max. :46.500
##
## SNOW avg_trip_duration avg_speed frequency
## Min. : 0.00000 Min. : 61.0 Min. :0.000 Min. : 1.000
## 1st Qu.: 0.00000 1st Qu.: 392.0 1st Qu.:1.935 1st Qu.: 1.000
## Median : 0.00000 Median : 647.3 Median :2.481 Median : 1.000
## Mean : 0.07141 Mean : 848.1 Mean :2.408 Mean : 1.451
## 3rd Qu.: 0.00000 3rd Qu.: 1080.0 3rd Qu.:2.967 3rd Qu.: 2.000
## Max. :10.20000 Max. :17859.0 Max. :8.432 Max. :23.000
##
#test dataset
testdata$X = NULL
testdata$Date = NULL
testdata$start_station_id = as.integer(testdata$start_station_id)
colnames(testdata)## [1] "start_month" "start_day" "start_station_id"
## [4] "usertype" "day" "age_group"
## [7] "time" "day_count" "avg_wind_speed"
## [10] "TMIN" "TMAX" "PRCP"
## [13] "SNOW" "avg_trip_duration" "avg_speed"
## [16] "frequency"
## 'data.frame': 148349 obs. of 16 variables:
## $ start_month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ start_day : int 6 6 6 6 6 6 6 6 6 6 ...
## $ start_station_id : int 813 530 638 821 318 65 86 345 128 788 ...
## $ usertype : Factor w/ 2 levels "Customer","Subscriber": 2 2 1 2 2 2 2 2 2 2 ...
## $ day : Factor w/ 7 levels "Friday","Monday",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ age_group : Factor w/ 4 levels "Adult","Elderly",..: 1 1 3 1 1 1 1 1 1 1 ...
## $ time : Factor w/ 4 levels "afternoon","evening",..: 1 2 1 3 3 1 1 2 3 2 ...
## $ day_count : int 6 6 6 6 6 6 6 6 6 6 ...
## $ avg_wind_speed : num 0 0 0 0 0 0 0 0 0 0 ...
## $ TMIN : num -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 ...
## $ TMAX : num 9.4 9.4 9.4 9.4 9.4 9.4 9.4 9.4 9.4 9.4 ...
## $ PRCP : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SNOW : int 0 0 0 0 0 0 0 0 0 0 ...
## $ avg_trip_duration: num 516 685 720 532 2042 ...
## $ avg_speed : num 2 2.69 1.66 3.36 1.81 ...
## $ frequency : int 2 3 1 1 2 2 2 2 1 2 ...
## start_month start_day start_station_id usertype
## Min. : 1.000 Min. : 2.00 Min. : 1.0 Customer : 27591
## 1st Qu.: 5.000 1st Qu.: 9.00 1st Qu.:184.0 Subscriber:120758
## Median : 7.000 Median :18.00 Median :409.0
## Mean : 7.275 Mean :17.42 Mean :426.9
## 3rd Qu.:10.000 3rd Qu.:26.00 3rd Qu.:708.0
## Max. :12.000 Max. :31.00 Max. :827.0
##
## day age_group time day_count
## Friday :18062 Adult :90298 afternoon:37667 Min. : 6.0
## Monday :19895 Elderly : 4668 evening :44357 1st Qu.:147.0
## Saturday :17199 Middle Age:50931 morning :43497 Median :205.0
## Sunday :19230 Teenager : 2452 night :22828 Mean :207.4
## Thursday :22388 3rd Qu.:280.0
## Tuesday :19484 Max. :365.0
## Wednesday:32091
## avg_wind_speed TMIN TMAX PRCP
## Min. :0.00 Min. :-14.30 Min. : 0.00 Min. : 0.000
## 1st Qu.:1.30 1st Qu.: 7.80 1st Qu.:16.10 1st Qu.: 0.000
## Median :1.60 Median : 13.90 Median :21.10 Median : 0.000
## Mean :1.82 Mean : 12.96 Mean :21.09 Mean : 2.914
## 3rd Qu.:2.20 3rd Qu.: 19.40 3rd Qu.:28.30 3rd Qu.: 1.000
## Max. :5.20 Max. : 27.80 Max. :35.00 Max. :46.200
##
## SNOW avg_trip_duration avg_speed frequency
## Min. :0.000000 Min. : 61.0 Min. :0.000 Min. : 1.000
## 1st Qu.:0.000000 1st Qu.: 401.0 1st Qu.:1.911 1st Qu.: 1.000
## Median :0.000000 Median : 664.5 Median :2.466 Median : 1.000
## Mean :0.006377 Mean : 864.2 Mean :2.386 Mean : 1.479
## 3rd Qu.:0.000000 3rd Qu.: 1104.0 3rd Qu.:2.950 3rd Qu.: 2.000
## Max. :1.000000 Max. :17862.0 Max. :7.206 Max. :21.000
##
Trip duration is heavily skewed to left although there are decent amount of outliers in the upper end. Note that the extreme 0.5% was removed (e.g. trip duration of 1+ day).
User group appears to be mainly from adult (20-44) and middle age (45-64).
User type turns out to have very high ratio of subscriber.
## Adult Elderly Middle Age Teenager
## 334701 17920 190794 8414
## Customer Subscriber
## 94173 457656
Data has been reorganized to split the data per each day. Information about start day and end day are treated separately. The information of interest are: average birth year, average trip duration, and total check-ins/check-outs per given day.
# Look for and download plyr package if it does not already exist
if (!require("plyr")) {
install.packages("plyr")
}## Loading required package: plyr
#read in plyr package
library(plyr)
#organize data per day (avg trip duration, avg speed per day, total check out per day, avg travel distance per day)
train_per_day = ddply(traindata,.(start_month,start_day, avg_wind_speed, TMIN, TMAX, PRCP, SNOW), c(function(x) mean(x$avg_trip_duration), function(x) mean(x$avg_speed), function(x) sum(x$frequency)))
colnames(train_per_day) = c("start_month","start_day","avg_wind_speed","TMIN","TMAX","PRCP","SNOW","avg_trip_duration","avg_speed","frequency")
test_per_day = ddply(testdata,.(start_month,start_day, avg_wind_speed, TMIN, TMAX, PRCP, SNOW), c(function(x) mean(x$avg_trip_duration), function(x) mean(x$avg_speed), function(x) sum(x$frequency)))
colnames(test_per_day) = c("start_month","start_day","avg_wind_speed","TMIN","TMAX","PRCP","SNOW","avg_trip_duration","avg_speed","frequency")
head(train_per_day)## start_month start_day avg_wind_speed TMIN TMAX PRCP SNOW avg_trip_duration
## 1 1 1 0 3.9 14.4 1.5 0 917.5800
## 2 1 2 0 1.7 4.4 0.0 0 729.0894
## 3 1 3 0 2.8 6.7 0.0 0 724.9942
## 4 1 4 0 1.7 8.3 0.0 0 739.6375
## 5 1 5 0 5.0 8.3 12.7 0 659.4482
## 6 1 7 0 -3.8 1.1 0.0 0 670.9264
## avg_speed frequency
## 1 2.238813 1136
## 2 2.521008 1866
## 3 2.571871 2065
## 4 2.543094 2163
## 5 2.518628 846
## 6 2.626848 1860
## start_month start_day avg_wind_speed TMIN
## Min. : 1.000 Min. : 1.00 Min. :0.000 Min. :-16.600
## 1st Qu.: 3.000 1st Qu.: 8.00 1st Qu.:0.900 1st Qu.: 1.100
## Median : 6.000 Median :15.00 Median :1.600 Median : 9.400
## Mean : 6.387 Mean :15.23 Mean :1.676 Mean : 8.874
## 3rd Qu.: 9.000 3rd Qu.:22.00 3rd Qu.:2.400 3rd Qu.: 17.200
## Max. :12.000 Max. :31.00 Max. :5.700 Max. : 26.700
## TMAX PRCP SNOW avg_trip_duration
## Min. :-9.90 Min. : 0.000 Min. : 0.0000 Min. : 542.2
## 1st Qu.: 7.80 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.: 706.2
## Median :17.20 Median : 0.000 Median : 0.0000 Median : 799.0
## Mean :16.52 Mean : 3.647 Mean : 0.1414 Mean : 814.8
## 3rd Qu.:25.60 3rd Qu.: 2.500 3rd Qu.: 0.0000 3rd Qu.: 888.3
## Max. :35.00 Max. :46.500 Max. :10.2000 Max. :1168.4
## avg_speed frequency
## Min. :2.040 Min. : 522
## 1st Qu.:2.363 1st Qu.:1842
## Median :2.452 Median :2819
## Mean :2.437 Mean :2742
## 3rd Qu.:2.545 3rd Qu.:3684
## Max. :2.866 Max. :4689
## start_month start_day avg_wind_speed TMIN TMAX PRCP SNOW avg_trip_duration
## 1 1 6 0 -0.5 9.4 0.0 0 733.9794
## 2 1 28 0 -3.8 3.3 0.0 0 650.7955
## 3 1 29 0 -3.8 6.1 5.8 0 658.9985
## 4 1 30 0 -14.3 1.7 0.3 1 636.5608
## 5 2 4 0 5.0 16.1 0.0 0 754.8000
## 6 2 18 0 -3.2 5.6 2.3 0 702.6888
## avg_speed frequency
## 1 2.459605 1607
## 2 2.650070 1684
## 3 2.647706 1594
## 4 2.508030 1212
## 5 2.544449 2271
## 6 2.535576 1272
## start_month start_day avg_wind_speed TMIN
## Min. : 1.000 Min. : 2.00 Min. :0.000 Min. :-14.30
## 1st Qu.: 5.000 1st Qu.: 9.00 1st Qu.:1.200 1st Qu.: 5.60
## Median : 7.000 Median :18.00 Median :1.600 Median : 12.80
## Mean : 7.082 Mean :17.68 Mean :1.737 Mean : 11.16
## 3rd Qu.:10.000 3rd Qu.:26.00 3rd Qu.:2.200 3rd Qu.: 18.30
## Max. :12.000 Max. :31.00 Max. :5.200 Max. : 27.80
## TMAX PRCP SNOW avg_trip_duration
## Min. : 0.00 Min. : 0.000 Min. :0.0000 Min. : 624.2
## 1st Qu.:12.80 1st Qu.: 0.000 1st Qu.:0.0000 1st Qu.: 754.8
## Median :20.00 Median : 0.000 Median :0.0000 Median : 831.1
## Mean :19.09 Mean : 3.877 Mean :0.0137 Mean : 839.8
## 3rd Qu.:26.70 3rd Qu.: 1.800 3rd Qu.:0.0000 3rd Qu.: 913.2
## Max. :35.00 Max. :46.200 Max. :1.0000 Max. :1142.9
## avg_speed frequency
## Min. :2.057 Min. : 713
## 1st Qu.:2.341 1st Qu.:2271
## Median :2.417 Median :3203
## Mean :2.408 Mean :3005
## 3rd Qu.:2.505 3rd Qu.:3903
## Max. :2.712 Max. :4889
#model
trip_duration_linear <- lm(avg_trip_duration ~ start_month + start_day + avg_wind_speed + TMIN + TMAX + PRCP + SNOW, data = train_per_day)
summary(trip_duration_linear)##
## Call:
## lm(formula = avg_trip_duration ~ start_month + start_day + avg_wind_speed +
## TMIN + TMAX + PRCP + SNOW, data = train_per_day)
##
## Residuals:
## Min 1Q Median 3Q Max
## -178.09 -48.73 -13.36 33.08 251.65
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 648.6231 18.4841 35.091 < 2e-16 ***
## start_month -0.4084 1.6984 -0.240 0.810
## start_day 0.2206 0.5404 0.408 0.683
## avg_wind_speed 0.1718 4.7859 0.036 0.971
## TMIN -2.0381 1.8619 -1.095 0.275
## TMAX 11.8893 1.6723 7.109 9.46e-12 ***
## PRCP -3.9023 0.6183 -6.311 1.06e-09 ***
## SNOW 7.8154 5.4484 1.434 0.153
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 78.51 on 284 degrees of freedom
## Multiple R-squared: 0.6541, Adjusted R-squared: 0.6456
## F-statistic: 76.73 on 7 and 284 DF, p-value: < 2.2e-16
#predict
trip_duration_pred_linear <- predict(trip_duration_linear, test_per_day)
#compare
plot(trip_duration_pred_linear, test_per_day$avg_trip_duration, main = "Linear model Prediction Assessmen",
xlab = "predicted" , ylab = "actual")
abline(a = 0, b = 1, col = "red")#compute prediction accuracy with threshold of 10%
trip_duration_accuracy_linear = ifelse(abs(trip_duration_pred_linear - test_per_day$avg_trip_duration)/test_per_day$avg_trip_duration > 0.10, 0, 1)
trip_duration_accuracy_linear## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
## 1 1 1 0 1 1 1 1 1 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 1 1
## 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
## 1 1 1 0 1 0 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0
## 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73
## 0 1 1 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1 1 1 1
#load library
library(class)
#model
train_labels = train_per_day$avg_trip_duration
test_labels = test_per_day$avg_trip_duration
#predict
trip_duration_pred_knn = knn(train = train_per_day, test = test_per_day, cl = train_labels, k = 17) #sqrt(sample size) is a rule of thumb
#compare
plot((as.numeric(levels(trip_duration_pred_knn))[trip_duration_pred_knn]), test_per_day$avg_trip_duration, main = "KNN model Prediction Assessment", xlab="predicted", ylab="actual")
abline(a = 0, b = 1, col = "red")#load library
library(neuralnet)
#model
trip_duration_model_ann = neuralnet(formula = avg_trip_duration ~ start_month + start_day + avg_wind_speed + TMIN + TMAX + PRCP + SNOW, data = train_per_day, hidden = 3)
#predict
trip_duration_pred_ann = compute(trip_duration_model_ann, test_per_day)
trip_duration_pred_ann_result = trip_duration_pred_ann$net.result
#compare
plot(trip_duration_pred_ann_result, test_per_day$avg_trip_duration, main = "ANN model Prediction Assessment", xlab="predicted", ylab="actual")
abline(a = 0, b = 1, col = "red")#load library
library(kernlab)
#model
trip_duration_model_svm = ksvm(avg_trip_duration ~ start_month + start_day + avg_wind_speed + TMIN + TMAX + PRCP + SNOW, data = train_per_day, kernel = "vanilladot")## Setting default kernel parameters
#predict
trip_duration_pred_svm = predict(trip_duration_model_svm, test_per_day)
#compare
plot(trip_duration_pred_svm, test_per_day$avg_trip_duration, main = "SVM model Prediction Assessment", xlab="predicted", ylab="actual")
abline(a = 0, b = 1, col = "red")#load library
library(C50)
#model
trip_duration_model_dt = C5.0(as.factor(avg_trip_duration) ~ start_month + start_day + avg_wind_speed + TMIN + TMAX + PRCP + SNOW, data = train_per_day)
#predict
trip_duration_pred_dt = predict(trip_duration_model_dt, test_per_day)
#compare
plot((as.numeric(levels(trip_duration_pred_dt))[trip_duration_pred_dt]), test_per_day$avg_trip_duration, main = "Decision Tree model Prediction Assessment", xlab="predicted", ylab="actual")
abline(a = 0, b = 1, col = "red")## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
#model
set.seed(12345)
trip_duration_model_rf = randomForest(as.factor(avg_trip_duration) ~ start_month + start_day + avg_wind_speed + TMIN + TMAX + PRCP + SNOW, data = train_per_day)
#predict
trip_duration_pred_rf = predict(trip_duration_model_rf, test_per_day)
#compare
plot((as.numeric(levels(trip_duration_pred_rf))[trip_duration_pred_rf]), test_per_day$avg_trip_duration, main = "Random Forest model Prediction Assessment", xlab="predicted", ylab="actual")
abline(a = 0, b = 1, col = "red")trip_duration_model_comparison = matrix(0, nrow=6, ncol=3)
colnames(trip_duration_model_comparison) = c("bad_prediction","good_prediction", "accuracy")
rownames(trip_duration_model_comparison) = c("linear","knn","ann","svm","decision_tree","random_forest")
trip_duration_model_comparison[1,] = c((length(trip_duration_accuracy_linear)-sum(trip_duration_accuracy_linear)), sum(trip_duration_accuracy_linear), round(sum(trip_duration_accuracy_linear)/length(trip_duration_accuracy_linear),3))
trip_duration_model_comparison[2,] = c((length(trip_duration_accuracy_knn)-sum(trip_duration_accuracy_knn)), sum(trip_duration_accuracy_knn), round(sum(trip_duration_accuracy_knn)/length(trip_duration_accuracy_knn),3))
trip_duration_model_comparison[3,] = c((length(trip_duration_accuracy_ann)-sum(trip_duration_accuracy_ann)), sum(trip_duration_accuracy_ann), round(sum(trip_duration_accuracy_ann)/length(trip_duration_accuracy_ann),3))
trip_duration_model_comparison[4,] = c((length(trip_duration_accuracy_svm)-sum(trip_duration_accuracy_svm)), sum(trip_duration_accuracy_svm), round(sum(trip_duration_accuracy_svm)/length(trip_duration_accuracy_svm),3))
trip_duration_model_comparison[5,] = c((length(trip_duration_accuracy_dt)-sum(trip_duration_accuracy_dt)), sum(trip_duration_accuracy_dt), round(sum(trip_duration_accuracy_dt)/length(trip_duration_accuracy_dt),3))
trip_duration_model_comparison[6,] = c((length(trip_duration_accuracy_rf)-sum(trip_duration_accuracy_rf)), sum(trip_duration_accuracy_rf), round(sum(trip_duration_accuracy_rf)/length(trip_duration_accuracy_rf),3))Based on the different models assessed, SVM gave the best prediction of average trip duration with the accuracy of ~79.5%. This is considering that the results are accurate if lying within 10% range of the actual trip duration value.
From the linear model, it can be concluded that there is a relationship between average trip duration and different weather parameters. For example, it was found that TMAX had a positive relationship and PRCP had a negative relationship with average trip duration. From this, it can be concluded that warmer days have longer rides and rainy days have shorter rides in general.
Additionally, it was found that start month is negatively related to the average bike speed. This finding can be correlated with the weather-related finding noted above in a way that higher month (colder weather) results in faster bike speed.
There was no significant impact of variables TMIN, SNOW, start_month, and start_day. This indicates that the trip duration is not impacted by lower temperatures, snow amount, and trip month or day.
## bad_prediction good_prediction accuracy
## linear 20 53 0.726
## knn 17 56 0.767
## ann 36 37 0.507
## svm 15 58 0.795
## decision_tree 27 46 0.630
## random_forest 27 46 0.630
#model
speed_model_linear = lm(avg_speed ~ start_month + start_day + avg_wind_speed + TMIN + TMAX + PRCP + SNOW, data = train_per_day)
summary(speed_model_linear)##
## Call:
## lm(formula = avg_speed ~ start_month + start_day + avg_wind_speed +
## TMIN + TMAX + PRCP + SNOW, data = train_per_day)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.27389 -0.07950 0.02405 0.07763 0.22919
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.6593008 0.0261865 101.552 < 2e-16 ***
## start_month -0.0109588 0.0024061 -4.555 7.80e-06 ***
## start_day -0.0010170 0.0007656 -1.328 0.185169
## avg_wind_speed 0.0154658 0.0067803 2.281 0.023290 *
## TMIN 0.0032415 0.0026377 1.229 0.220122
## TMAX -0.0121248 0.0023692 -5.118 5.71e-07 ***
## PRCP 0.0029408 0.0008760 3.357 0.000895 ***
## SNOW -0.0134834 0.0077187 -1.747 0.081745 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1112 on 284 degrees of freedom
## Multiple R-squared: 0.4905, Adjusted R-squared: 0.4779
## F-statistic: 39.06 on 7 and 284 DF, p-value: < 2.2e-16
#predict
speed_pred_linear = predict(speed_model_linear, newdata = test_per_day)
#compare
plot(speed_pred_linear, test_per_day$avg_speed, main = "Linear model Prediction Assessment", xlab="predicted", ylab="actual")
abline(a = 0, b = 1, col = "red")For interaction, look at significant factors from simple linear regression
#model
speed_model_linear_interaction = lm(avg_speed ~ start_month + start_day + avg_wind_speed + TMIN + TMAX + PRCP + SNOW + start_month * avg_wind_speed + start_month * TMAX + start_month * PRCP + avg_wind_speed * TMAX + avg_wind_speed * PRCP + TMAX * PRCP, data = train_per_day)
summary(speed_model_linear_interaction)##
## Call:
## lm(formula = avg_speed ~ start_month + start_day + avg_wind_speed +
## TMIN + TMAX + PRCP + SNOW + start_month * avg_wind_speed +
## start_month * TMAX + start_month * PRCP + avg_wind_speed *
## TMAX + avg_wind_speed * PRCP + TMAX * PRCP, data = train_per_day)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.26712 -0.06803 0.02498 0.07633 0.20333
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.644e+00 2.921e-02 90.526 < 2e-16 ***
## start_month -1.480e-02 5.820e-03 -2.542 0.011557 *
## start_day -1.013e-03 7.553e-04 -1.342 0.180816
## avg_wind_speed 7.479e-02 1.783e-02 4.196 3.66e-05 ***
## TMIN 2.017e-03 2.727e-03 0.740 0.460051
## TMAX -1.145e-02 2.812e-03 -4.070 6.13e-05 ***
## PRCP 2.404e-03 2.816e-03 0.854 0.393976
## SNOW -1.076e-02 8.326e-03 -1.293 0.197227
## start_month:avg_wind_speed -3.065e-03 1.787e-03 -1.715 0.087486 .
## start_month:TMAX 5.913e-04 3.251e-04 1.819 0.069973 .
## start_month:PRCP -3.288e-05 3.049e-04 -0.108 0.914195
## avg_wind_speed:TMAX -2.935e-03 8.791e-04 -3.338 0.000959 ***
## avg_wind_speed:PRCP -3.302e-04 8.275e-04 -0.399 0.690195
## TMAX:PRCP 7.507e-05 1.016e-04 0.739 0.460495
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1094 on 278 degrees of freedom
## Multiple R-squared: 0.5172, Adjusted R-squared: 0.4946
## F-statistic: 22.91 on 13 and 278 DF, p-value: < 2.2e-16
#load library
library(class)
#model
train_labels = train_per_day$avg_speed
test_labels = test_per_day$avg_speed
#predict
speed_pred_knn = knn(train = train_per_day, test = test_per_day, cl = train_labels, k = 17) #sqrt(sample size) is a rule of thumb
#compare
plot((as.numeric(levels(speed_pred_knn))[speed_pred_knn]), test_per_day$avg_speed, main = "KNN model Prediction Assessment", xlab="predicted", ylab="actual")
abline(a = 0, b = 1, col = "red")#load library
library(neuralnet)
#model
speed_model_ann = neuralnet(formula = avg_speed ~ start_month + start_day + avg_wind_speed + TMIN + TMAX + PRCP + SNOW, data = train_per_day, hidden = 3)
#predict
speed_pred_ann = compute(speed_model_ann, test_per_day)
speed_pred_ann_result = speed_pred_ann$net.result
#compare
plot(speed_pred_ann_result, test_per_day$avg_speed, main = "ANN model Prediction Assessment", xlab="predicted", ylab="actual")
abline(a = 0, b = 1, col = "red")#load library
library(kernlab)
#model
speed_model_svm = ksvm(avg_speed ~ start_month + start_day + avg_wind_speed + TMIN + TMAX + PRCP + SNOW, data = train_per_day, kernel = "vanilladot")## Setting default kernel parameters
#predict
speed_pred_svm = predict(speed_model_svm, test_per_day)
#compare
plot(speed_pred_svm, test_per_day$avg_speed, main = "SVM model Prediction Assessment", xlab="predicted", ylab="actual")
abline(a = 0, b = 1, col = "red")#load library
library(C50)
#model
speed_model_dt = C5.0(as.factor(avg_speed) ~ start_month + start_day + avg_wind_speed + TMIN + TMAX + PRCP + SNOW, data = train_per_day)
#predict
speed_pred_dt = predict(speed_model_dt, test_per_day)
#compare
plot((as.numeric(levels(speed_pred_dt))[speed_pred_dt]), test_per_day$avg_speed, main = "Decision Tree model Prediction Assessment", xlab="predicted", ylab="actual")
abline(a = 0, b = 1, col = "red")#load library
library(randomForest)
#model
set.seed(12345)
speed_model_rf = randomForest(as.factor(avg_speed) ~ start_month + start_day + avg_wind_speed + TMIN + TMAX + PRCP + SNOW, data = train_per_day)
#predict
speed_pred_rf = predict(speed_model_rf, test_per_day)
#compare
plot((as.numeric(levels(speed_pred_rf))[speed_pred_rf]), test_per_day$avg_speed, main = "Random Forest model Prediction Assessment", xlab="predicted", ylab="actual")
abline(a = 0, b = 1, col = "red")speed_model_comparison = matrix(0, nrow=6, ncol=3)
colnames(speed_model_comparison) = c("bad_prediction","good_prediction", "accuracy")
rownames(speed_model_comparison) = c("linear","knn","ann","svm","decision_tree","random_forest")
speed_model_comparison[1,] = c((length(speed_accuracy_linear)-sum(speed_accuracy_linear)), sum(speed_accuracy_linear), round(sum(speed_accuracy_linear)/length(speed_accuracy_linear),3))
speed_model_comparison[2,] = c((length(speed_accuracy_knn)-sum(speed_accuracy_knn)), sum(speed_accuracy_knn), round(sum(speed_accuracy_knn)/length(speed_accuracy_knn),3))
speed_model_comparison[3,] = c((length(speed_accuracy_ann)-sum(speed_accuracy_ann)), sum(speed_accuracy_ann), round(sum(speed_accuracy_ann)/length(speed_accuracy_ann),3))
speed_model_comparison[4,] = c((length(speed_accuracy_svm)-sum(speed_accuracy_svm)), sum(speed_accuracy_svm), round(sum(speed_accuracy_svm)/length(speed_accuracy_svm),3))
speed_model_comparison[5,] = c((length(speed_accuracy_dt)-sum(speed_accuracy_dt)), sum(speed_accuracy_dt), round(sum(speed_accuracy_dt)/length(speed_accuracy_dt),3))
speed_model_comparison[6,] = c((length(speed_accuracy_rf)-sum(speed_accuracy_rf)), sum(speed_accuracy_rf), round(sum(speed_accuracy_rf)/length(speed_accuracy_rf),3))It turns out that knn is best at predicting the average speed with the accuracy of ~95%.
From the linear model, it can be concluded that there is a relationship between average bike speed and weather parameters. For example, it was found that wind speed/TMAX had negative relationship and PRCP had positive relationship with average bike speed. From this, it can be concluded that worse weather results in faster bike speed. Also, among significant factors from simple linear regression model, it was found that there is a negative interaction between average wind speed and TMAX (higher temperature, lower wind speed).
Additionally, it was found that start month is negatively related to the average bike speed. This finding can be correlated with the weather-related finding noted above in a way that higher month (colder weather) results in faster bike speed.
## bad_prediction good_prediction accuracy
## linear 5 68 0.932
## knn 4 69 0.945
## ann 10 63 0.863
## svm 7 66 0.904
## decision_tree 13 60 0.822
## random_forest 8 65 0.890
frequency_linear <- lm(frequency ~ start_month + start_day + avg_wind_speed + TMIN + TMAX + PRCP + SNOW, data = train_per_day)
summary(frequency_linear)##
## Call:
## lm(formula = frequency ~ start_month + start_day + avg_wind_speed +
## TMIN + TMAX + PRCP + SNOW, data = train_per_day)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2040.10 -277.71 66.89 315.36 1231.84
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1739.344 123.646 14.067 < 2e-16 ***
## start_month 6.964 11.361 0.613 0.54040
## start_day -9.530 3.615 -2.636 0.00885 **
## avg_wind_speed 3.152 32.015 0.098 0.92164
## TMIN 20.811 12.455 1.671 0.09583 .
## TMAX 66.148 11.187 5.913 9.62e-09 ***
## PRCP -47.821 4.136 -11.561 < 2e-16 ***
## SNOW -37.774 36.446 -1.036 0.30087
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 525.2 on 284 degrees of freedom
## Multiple R-squared: 0.7712, Adjusted R-squared: 0.7656
## F-statistic: 136.8 on 7 and 284 DF, p-value: < 2.2e-16
#predict
frequency_pred_linear <- predict(frequency_linear, test_per_day)
#compare
plot(frequency_pred_linear, test_per_day$frequency, main = "Linear model Prediction Assessmen",
xlab = "predicted" , ylab = "actual")
abline(a = 0, b = 1, col = "red")#load library
library(class)
#model
train_labels = train_per_day$frequency
test_labels = test_per_day$frequency
#predict
frequency_pred_knn = knn(train = train_per_day, test = test_per_day, cl = train_labels, k = 17) #sqrt(sample size) is a rule of thumb
#compare
plot((as.numeric(levels(frequency_pred_knn))[frequency_pred_knn]), test_per_day$frequency, main = "KNN model Prediction Assessment", xlab="predicted", ylab="actual")
abline(a = 0, b = 1, col = "red")#load library
library(neuralnet)
#model
frequency_model_ann = neuralnet(formula = frequency ~ start_month + start_day + avg_wind_speed + TMIN + TMAX + PRCP + SNOW, data = train_per_day, hidden = 3)
#predict
frequency_pred_ann = compute(frequency_model_ann, test_per_day)
frequency_pred_ann_result = frequency_pred_ann$net.result
#compare
plot(frequency_pred_ann_result, test_per_day$frequency, main = "ANN model Prediction Assessment", xlab="predicted", ylab="actual")
abline(a = 0, b = 1, col = "red")#load library
library(kernlab)
#model
frequency_model_svm = ksvm(frequency ~ start_month + start_day + avg_wind_speed + TMIN + TMAX + PRCP + SNOW, data = train_per_day, kernel = "vanilladot")## Setting default kernel parameters
#predict
frequency_pred_svm = predict(frequency_model_svm, test_per_day)
#compare
plot(frequency_pred_svm, test_per_day$frequency, main = "SVM model Prediction Assessment", xlab="predicted", ylab="actual")
abline(a = 0, b = 1, col = "red")#load library
library(C50)
#model
frequency_model_dt = C5.0(as.factor(frequency) ~ start_month + start_day + avg_wind_speed + TMIN + TMAX + PRCP + SNOW, data = train_per_day)
#predict
frequency_pred_dt = predict(frequency_model_dt, test_per_day)
#compare
plot((as.numeric(levels(frequency_pred_dt))[frequency_pred_dt]), test_per_day$frequency, main = "Decision Tree model Prediction Assessment", xlab="predicted", ylab="actual")
abline(a = 0, b = 1, col = "red")#load library
library(randomForest)
#model
set.seed(12345)
frequency_model_rf = randomForest(as.factor(frequency) ~ start_month + start_day + avg_wind_speed + TMIN + TMAX + PRCP + SNOW, data = train_per_day)
#predict
frequency_pred_rf = predict(frequency_model_rf, test_per_day)
#compare
plot((as.numeric(levels(frequency_pred_rf))[frequency_pred_rf]), test_per_day$frequency, main = "Random Forest model Prediction Assessment", xlab="predicted", ylab="actual")
abline(a = 0, b = 1, col = "red")frequency_model_comparison = matrix(0, nrow=6, ncol=3)
colnames(frequency_model_comparison) = c("bad_prediction","good_prediction", "accuracy")
rownames(frequency_model_comparison) = c("linear","knn","ann","svm","decision_tree","random_forest")
frequency_model_comparison[1,] = c((length(frequency_accuracy_linear)-sum(frequency_accuracy_linear)), sum(frequency_accuracy_linear), round(sum(frequency_accuracy_linear)/length(frequency_accuracy_linear),3))
frequency_model_comparison[2,] = c((length(frequency_accuracy_knn)-sum(frequency_accuracy_knn)), sum(frequency_accuracy_knn), round(sum(frequency_accuracy_knn)/length(frequency_accuracy_knn),3))
frequency_model_comparison[3,] = c((length(frequency_accuracy_ann)-sum(frequency_accuracy_ann)), sum(frequency_accuracy_ann), round(sum(frequency_accuracy_ann)/length(frequency_accuracy_ann),3))
frequency_model_comparison[4,] = c((length(frequency_accuracy_svm)-sum(frequency_accuracy_svm)), sum(frequency_accuracy_svm), round(sum(frequency_accuracy_svm)/length(frequency_accuracy_svm),3))
frequency_model_comparison[5,] = c((length(frequency_accuracy_dt)-sum(frequency_accuracy_dt)), sum(frequency_accuracy_dt), round(sum(frequency_accuracy_dt)/length(frequency_accuracy_dt),3))
frequency_model_comparison[6,] = c((length(frequency_accuracy_rf)-sum(frequency_accuracy_rf)), sum(frequency_accuracy_rf), round(sum(frequency_accuracy_rf)/length(frequency_accuracy_rf),3))From the results shown below, knn has the best accuracy for predicting check-outs with the accuracy of 0.945.
In addition, based on the linear model analysis, it shows that there is a relationship between check-outs amount and weather parameters. From the analysis, it appears that TMAX has a positive relationship with amount of check-outs while PRCP has a negative relationship.
## bad_prediction good_prediction accuracy
## linear 45 28 0.384
## knn 2 71 0.973
## ann 60 13 0.178
## svm 42 31 0.425
## decision_tree 50 23 0.315
## random_forest 40 33 0.452
# Calculating average travel distance
train_per_day$avg_distance <- train_per_day$avg_speed * train_per_day$avg_trip_duration
test_per_day$avg_distance <- test_per_day$avg_speed * test_per_day$avg_trip_duration#model
distance_linear <- lm(avg_distance ~ start_month + start_day + avg_wind_speed + TMIN + TMAX + PRCP + SNOW, data = train_per_day)
summary(distance_linear)##
## Call:
## lm(formula = avg_distance ~ start_month + start_day + avg_wind_speed +
## TMIN + TMAX + PRCP + SNOW, data = train_per_day)
##
## Residuals:
## Min 1Q Median 3Q Max
## -315.47 -71.50 -4.91 52.84 388.18
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1745.5345 25.6925 67.939 < 2e-16 ***
## start_month -8.7316 2.3607 -3.699 0.00026 ***
## start_day -0.2766 0.7512 -0.368 0.71296
## avg_wind_speed 11.5331 6.6524 1.734 0.08406 .
## TMIN -2.1249 2.5880 -0.821 0.41229
## TMAX 18.4861 2.3245 7.953 4.35e-14 ***
## PRCP -6.8540 0.8595 -7.975 3.77e-14 ***
## SNOW 8.0064 7.5731 1.057 0.29131
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 109.1 on 284 degrees of freedom
## Multiple R-squared: 0.7217, Adjusted R-squared: 0.7149
## F-statistic: 105.2 on 7 and 284 DF, p-value: < 2.2e-16
#predict
distance_pred_linear <- predict(distance_linear, test_per_day)
#compare
plot(distance_pred_linear, test_per_day$avg_distance, main = "Linear Model For Distance",
xlab = "predicted" , ylab = "actual")
abline(a = 0, b = 1, col = "red")#compute prediction accuracy with threshold of 10%
distance_accuracy_linear = ifelse(abs(distance_pred_linear - test_per_day$avg_distance)/test_per_day$avg_distance > 0.10, 0, 1)
distance_accuracy_linear## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
## 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1
## 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
## 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
## 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73
## 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1
#load library
library(class)
#model
train_labels = train_per_day$avg_distance
test_labels = test_per_day$avg_distance
#predict
distance_pred_knn = knn(train = train_per_day, test = test_per_day, cl = train_labels, k = 17) #sqrt(sample size) is a rule of thumb
#compare
plot((as.numeric(levels(distance_pred_knn))[distance_pred_knn]), test_per_day$avg_distance, main = "KNN model Prediction Assessment", xlab="predicted", ylab="actual")
abline(a = 0, b = 1, col = "red")#load library
library(neuralnet)
#model
distance_model_ann = neuralnet(formula = avg_distance ~ start_month + start_day + avg_wind_speed + TMIN + TMAX + PRCP + SNOW, data = train_per_day, hidden = 3)
#predict
distance_pred_ann = compute(distance_model_ann, test_per_day)
distance_pred_ann_result = distance_pred_ann$net.result
#compare
plot(distance_pred_ann_result, test_per_day$avg_distance, main = "ANN model Prediction Assessment", xlab="predicted", ylab="actual")
abline(a = 0, b = 1, col = "red")#load library
library(kernlab)
#model
distance_model_svm = ksvm(avg_distance ~ start_month + start_day + avg_wind_speed + TMIN + TMAX + PRCP + SNOW, data = train_per_day, kernel = "vanilladot")## Setting default kernel parameters
#predict
distance_pred_svm = predict(distance_model_svm, test_per_day)
#compare
plot(distance_pred_svm, test_per_day$avg_distance, main = "SVM model Prediction Assessment", xlab="predicted", ylab="actual")
abline(a = 0, b = 1, col = "red")#load library
library(C50)
#model
distance_model_dt = C5.0(as.factor(avg_distance) ~ start_month + start_day + avg_wind_speed + TMIN + TMAX + PRCP + SNOW, data = train_per_day)
#predict
distance_pred_dt = predict(distance_model_dt, test_per_day)
#compare
plot((as.numeric(levels(distance_pred_dt))[distance_pred_dt]), test_per_day$avg_distance, main = "Decision Tree model Prediction Assessment", xlab="predicted", ylab="actual")
abline(a = 0, b = 1, col = "red")#load library
library(randomForest)
#model
set.seed(12345)
distance_model_rf = randomForest(as.factor(avg_distance) ~ start_month + start_day + avg_wind_speed + TMIN + TMAX + PRCP + SNOW, data = train_per_day)
#predict
distance_pred_rf = predict(distance_model_rf, test_per_day)
#compare
plot((as.numeric(levels(distance_pred_rf))[distance_pred_rf]), test_per_day$avg_distance, main = "Random Forest model Prediction Assessment", xlab="predicted", ylab="actual")
abline(a = 0, b = 1, col = "red")distance_model_comparison = matrix(0, nrow=6, ncol=3)
colnames(distance_model_comparison) = c("bad_prediction","good_prediction", "accuracy")
rownames(distance_model_comparison) = c("linear","knn","ann","svm","decision_tree","random_forest")
distance_model_comparison[1,] = c((length(distance_accuracy_linear)-sum(distance_accuracy_linear)), sum(distance_accuracy_linear), round(sum(distance_accuracy_linear)/length(distance_accuracy_linear),3))
distance_model_comparison[2,] = c((length(distance_accuracy_knn)-sum(distance_accuracy_knn)), sum(distance_accuracy_knn), round(sum(distance_accuracy_knn)/length(distance_accuracy_knn),3))
distance_model_comparison[3,] = c((length(distance_accuracy_ann)-sum(distance_accuracy_ann)), sum(distance_accuracy_ann), round(sum(distance_accuracy_ann)/length(distance_accuracy_ann),3))
distance_model_comparison[4,] = c((length(distance_accuracy_svm)-sum(distance_accuracy_svm)), sum(distance_accuracy_svm), round(sum(distance_accuracy_svm)/length(distance_accuracy_svm),3))
distance_model_comparison[5,] = c((length(distance_accuracy_dt)-sum(distance_accuracy_dt)), sum(distance_accuracy_dt), round(sum(distance_accuracy_dt)/length(distance_accuracy_dt),3))
distance_model_comparison[6,] = c((length(distance_accuracy_rf)-sum(distance_accuracy_rf)), sum(distance_accuracy_rf), round(sum(distance_accuracy_rf)/length(distance_accuracy_rf),3))As with previous results, KNN was the best model with a prediction accuracy of ~97% with an error threshold of 10%.
The linear model shows notable linkages between average distance travelled in a day and start month, maximum temperature, and precipitation.
Ride distances increased precipitously with increase in max daily temperature, dropped with an increase in precipitation, and decreased with increasing calendar months. It is curious to note that through the course of a calendar year, ride distances decreased. One would expect distance to increase from January (winter weather) to July (summer weather) and subsequently decrease through December (return to winter weather).
Because of the inherent seasonality and human factors involved with the data, a linear regression paints a rough picture but other models such as KNN provide more accurate predictions.
## bad_prediction good_prediction accuracy
## linear 7 66 0.904
## knn 2 71 0.973
## ann 20 53 0.726
## svm 7 66 0.904
## decision_tree 16 57 0.781
## random_forest 8 65 0.890
To be filled.